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Abstract 



This note introduces Venn-Abers predictors, a new class of Venn pre- 
dictors based on the idea of isotonic regression. As all Venn predictors, 
Venn-Abers predictors are well calibrated under the exchangeability as- 
sumption. 



1 Introduction 

This note is prompted by [2], which demonstrates that the probability fore- 
casting procedure introduced by Zadrozny and Elkan in [?] (an adaptation of 
the isotonic regression procedure of [1]) can be poorly calibrated, whereas Venn 
predictors ([3], Chapter 6) are always well calibrated in their experiments and, 
moreover, are guaranteed to be well calibrated under the exchangeability as- 
sumption. This note shows that a simple modification of Zadrozny and Elkan's 
procedure is also a Venn predictor and so overcomes the problem of poten- 
tial poor calibration. (The modified procedure, however, is a multiprobability 
predictor.) 

2 Venn-Abers predictors 

We consider examples z = (x, y) consisting of two components: an object ieX 
and its label y S Y. In this note we are only interested in the binary case and 
for concreteness set Y := {0, 1}. We will use the notation |ai, . . . , a„j for bags 
(in other words, multisets); the cardinality of the set {a 1; . . . , a„} might well be 
smaller than n (because of the removal of all duplicates in the bag) . As usual, 
a "training set" is a bag of examples rather than a set. We say that a function 
/ is increasing if its domain is an ordered set and t\ < £2 =>■ f{t\) < ffo). 

Many machine-learning algorithms for classification are in fact scoring algo- 
rithms: when trained on a training set of examples and fed with a test object x, 
they output a prediction score s(x); we will call s : X — > M the scoring function 
for that training set. The actual classification algorithm is obtained by fixing a 



1 



threshold c and predicting the label of a; to be 1 if and only if s(x) > c (or if 
and only if s(x) > c). Alternatively, one could apply an increasing function g 
to s(x) in an attempt to "calibrate" the scores, so that g(s(x)) can be used as 
the predicted probability that the label of a; is 1. 

Fix a scoring algorithm and let \z\, . be a training set of examples 

Zj = (x i7 yi), i = 1,...,L The most direct application [3] of the method of 
isotonic regression Q] to the problem of score calibration is as follows. Train 
the scoring algorithm on the training set and compute the score s{x{) for each 
training example (xi,yi), where s is the scoring function for \z\, . . . ,Zi§. Let 
g be the increasing function on the set {s(xi), . . . , s(xi)} that maximizes the 
likelihood 



Such a function g is indeed unique ([T], Corollary 2.1) and can be easily found 
using the "pair-adjacent violators algorithm" (PAVA, described in detail in the 
summary of pQ and, in a special case, in [3]; see also the proof of Lemma[T]below). 
We will say that g is the isotonic calibrator for l(s(xi), yi), . . . , (s(xi), yi) j. To 
predict the label of a test object x, the direct procedure finds the closest s(xi) 
to s(x) and outputs g(s(xi)) as its prediction (we do not go into details such as 
breaking the ties or the possibility of interpolation) . 

The direct procedure is prone to overfitting as the same examples z\, . . . , zi 
are used both for training the scoring algorithm and for calibration without 
taking any precautions. The Venn-Abers predictor is the multiprobability pre- 
dictor that is defined as follows. Try the two different classifications, and 
1, for the test object x. Let so be the scoring function for \z\, . . ■ ,z;, (x, 0)J, 
si be the scoring function for \z\, . . . , z\, {x, 1)J, g$ be the isotonic calibrator 
for 1(sq(xi), yi), . . . , (s Q (xi), y{), (sq(x), 0) j, and gi be the isotonic calibrator 
for \(si(xi),yi), . . . , (si(xi),yi), (s±(x), The multiprediction output by the 
Venn-Abers predictor is {po,pi}, where po ■— go(so(x)) and p\ := g\(s\(x)\ 
(And we can expect po and p\ to be close to each other unless the direct proce- 
dure overfits grossly.) 

In general, Venn-Abers predictors are computationally inefficient, es- 
pecially if we would like to apply them to a large number of test exam- 
ples and the same training set. More computationally efficient pre-trained 
Venn-Abers predictors are defined as follows. The training set \zi,...,zi\ 
is split into two parts: the proper training set \zi,...,z m § of size m < I 
and the calibration set \z m +i, . . . , Zi\ of size I — m. Let s : X — >• K 
be the scoring function for \z\ , . . . , z m j, go be the isotonic calibrator for 
l(s(x m+ i),y m+ i),...,(s(xt),yi),(s(x),0)$, and g x be the isotonic calibrator 
for l(s(x m+ i), y m +i), • ■ • , (s(xi), yi), (s(x), 1) j. The multiprobability prediction 
output by the pre-trained Venn-Abers predictor is {po ,pi}, where po : = go(s(x)) 
and pi := gi(s(x)). (This definition is in the spirit of inductive conformal pre- 
dictors [3], Section 4.1, but we avoid using the term "inductive Venn-Abers 
predictors" since our pre-trained Venn-Abers predictors are not inductive Venn 
predictors the sense of [2], Section 3.1.) 
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Venn predictors are denned as in [3J, Chapter 6, except that a probability 
distribution P on the set {0, 1} is now represented by the number P({1}) S [0, 1] . 

Proposition 1. Venn-Abers predictors are Venn predictors. Pre-trained 
Venn-Abers predictors are Venn predictors when considered as functions of 

{z m +l, ■ ■ ■ , Zl). 

Proof. Fix a Venn-Abers predictor. The corresponding taxonomy is defined 
as follows: assign . . . , z n j, (x, y)) and Q_z[, . . . , z' n , j, (x 1 , y')) to the same 
cell if and only if g(s(x)) = g'(s'(x')), where s is the scoring function for 
\zi, . . . , Zn, {x, y)], s' is the scoring function for \z' 1i . . . , z' n ,, [x 1 , y')], g is the 
isotonic calibrator for \{s{x\), y\), . . . , (s(x n ), y n ), (s(x), and g' is the iso- 
tonic calibrator for \(s' (x[) , y[) , . . . , {s'{x' n ,),y' n ,), (s'(x'), y')]. Lemma [1] below 
shows that the Venn predictor corresponding to this taxonomy gives predictions 
identical to those given by the original Venn-Abers predictor. This proves the 
first statement of the proposition. 

The second statement follows from the fact that for a fixed bag \z\, . . . , z m \ 
the pre-trained Venn-Abers predictor is the Venn-Abers predictor correspond- 
ing to a scoring function sq = si = s that does not depend on the data 
\z m +i, ■■■,zi] at all. □ 

Lemma 1. Let g be the isotonic calibrator for \{t\, yi), ■ ■ . , (t n , y n )§> where 
<i £ I and yi € {0, 1}, i = 1, . . . , n. Any p € {<?(ii)> • ■ • , g(t n )} is equal to the 
arithmetic mean of the labels yi of the ti, i = 1, . . . , n, satisfying g(U) = p. 

Proof. The statement of the lemma immediately follows from the definition of 
the PAVA (pQ, summary), which we will reproduce here. Arrange the numbers 
ti in the strictly increasing order tm < ••• < tai, where k < n is the number 
of distinct elements among ti. We would like to find the increasing function g 
on the set {tm, ■ ■ ■ = {t\, ■ ■ ■ ,t n } maximizing the likelihood (defined by 

(PJ with ti in place of s(x^) and n in place of I). The procedure is recursive. 
At each step the set {tm, • • • , t(fc)} is partitioned into a number of disjoint cells 
consisting of adjacent elements of the set; to each cell is assigned a ratio a/N 
(formally, a pair of integers, with a > and N > 0); the function g defined at 
this step (perhaps to be redefined at the following steps) is constant on each 
cell. For j = 1, . . . , k, let dj be the number of i such that yi — 1 and ti = t/j\, 
and let Nj be the number of i such that ti — tu-\ . Start from the partition of 
{tm, ■ ■ ■ ,t(k)} into one-element cells, assign the ratio aj/Nj to {t(j)}, and set 

nM ■■= ^ (2) 

(in the notation used in this proof, a/N is a pair of integers whereas ■% is a 
rational number, the result of the division). If the function g is increasing, we 
are done. If not, there is a pair C\,G% of adjacent cells ("violators") such that 
C\ is to the left of Ci and g{C\) > g{C2) (where g{C) stands for the common 
value of g(t(j)) for t^ e C); in this case redefine the partition by merging C\ 
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and C 2 into one cell C, assigning the ratio (a% + 0,2) /{N\ + N2) to C, where 
a\/N\ and 0,2 /N2 are the ratios assigned to C\ and C2, respectively, and setting 



for all t(j) £ C. Repeat the process until g becomes constant (the number 
of cells decreases by 1 at each iteration, so the process will terminate in at 
most k steps). The final function g is the one that maximizes the likelihood. 
The statement of the lemma follows from this recursive definition: it is true by 
definition for the initial function and remains true when g is redefined by 
©. □ 

3 Conclusion 

This note has introduced a new class of Venn predictors thereby extending the 
domain of applicability of the method. 
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